feat: wire hub OTel metrics + pipeline health observability (Phases 1-3)#239
feat: wire hub OTel metrics + pipeline health observability (Phases 1-3)#239ptone wants to merge 9 commits into
Conversation
The hub's dbmetrics and dispatchmetrics recorders were created with NewDisabled() — all OTel instruments recorded to no-op sinks. This wires them to a real GCP Cloud Monitoring MeterProvider during server startup, lighting up ~20 existing instruments (pg LISTEN/NOTIFY latency, notifications published/delivered/dropped, subscriber lag, dispatch lifecycle, pool stats). New package pkg/observability/hubmetrics provides NewMeterProvider() which creates a PeriodicReader (60s) with the GCP metric exporter. Metric groups (db-notify, db-pool, dispatch, hub-auth, hub-gcp) can be independently disabled via env vars using OTel View Drop(). Graceful degradation: when GCPProjectID is empty or the exporter fails, the hub logs a warning and continues with disabled recorders — identical to the previous behavior.
There was a problem hiding this comment.
Code Review
This pull request integrates OpenTelemetry metrics export to GCP Cloud Monitoring for the hub server, introducing database and dispatch metrics recorders. It adds a new hubmetrics package to manage the MeterProvider, which supports disabling specific metric groups via environment variables. The review feedback suggests two important improvements: utilizing a context with a timeout during MeterProvider shutdown to prevent potential hangs, and refactoring the TestIsGroupDisabled test to correctly use its defined table fields instead of hardcoded checks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if mpErr != nil { | ||
| log.Printf("WARNING: hub metrics export disabled: %v", mpErr) | ||
| } else { | ||
| defer mp.Shutdown(context.Background()) |
There was a problem hiding this comment.
Using context.Background() directly in mp.Shutdown can cause the application shutdown to hang indefinitely if the OpenTelemetry exporter is unable to flush metrics (e.g., due to network issues or GCP API latency). It is recommended to use a context with a reasonable timeout (e.g., 5 seconds) to ensure graceful degradation and prevent hanging container terminations.
defer func() {
shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := mp.Shutdown(shutdownCtx); err != nil {
log.Printf("WARNING: failed to shutdown meter provider: %v", err)
}
}()| func TestIsGroupDisabled(t *testing.T) { | ||
| tests := []struct { | ||
| value string | ||
| disabled bool | ||
| }{ | ||
| {"", false}, | ||
| {"true", false}, | ||
| {"1", false}, | ||
| {"false", false}, | ||
| {"0", false}, | ||
| } | ||
|
|
||
| for _, tc := range tests { | ||
| t.Run(tc.value, func(t *testing.T) { | ||
| envVar := "SCION_METRICS_TEST_GROUP" | ||
| if tc.value != "" { | ||
| os.Setenv(envVar, tc.value) | ||
| t.Cleanup(func() { os.Unsetenv(envVar) }) | ||
| } else { | ||
| os.Unsetenv(envVar) | ||
| } | ||
| got := isGroupDisabled(envVar) | ||
| // Re-check directly since we test the env var we set | ||
| if tc.value == "false" || tc.value == "0" { | ||
| if !got { | ||
| t.Errorf("isGroupDisabled(%q=%q) = false, want true", envVar, tc.value) | ||
| } | ||
| } else { | ||
| if got { | ||
| t.Errorf("isGroupDisabled(%q=%q) = true, want false", envVar, tc.value) | ||
| } | ||
| } | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
The table-driven test TestIsGroupDisabled defines a disabled field in its test cases but completely ignores it during execution. Instead, it hardcodes the expected outcomes based on tc.value == "false" || tc.value == "0". Additionally, the test cases for "false" and "0" incorrectly set disabled to false in the struct definition. Refactoring the test to use the disabled field directly makes the test cases accurate and the test logic cleaner.
func TestIsGroupDisabled(t *testing.T) {
tests := []struct {
value string
disabled bool
}{
{"", false},
{"true", false},
{"1", false},
{"false", true},
{"0", true},
}
for _, tc := range tests {
t.Run(tc.value, func(t *testing.T) {
envVar := "SCION_METRICS_TEST_GROUP"
if tc.value != "" {
os.Setenv(envVar, tc.value)
t.Cleanup(func() { os.Unsetenv(envVar) })
} else {
os.Unsetenv(envVar)
}
got := isGroupDisabled(envVar)
if got != tc.disabled {
t.Errorf("isGroupDisabled(%q=%q) = %t, want %t", envVar, tc.value, got, tc.disabled)
}
})
}
}Use the table's 'want' field directly instead of re-deriving the expected value in the assertion body.
Prevents indefinite hang if the GCP metric exporter is unresponsive during server shutdown.
Add OTelMetricsRecorder and OTelGCPTokenMetrics that implement the existing MetricsRecorder and new GCPTokenMetricsRecorder interfaces using OTel instruments for Cloud Monitoring export. Both use a dual-write pattern — OTel instruments for cloud export plus embedded atomic structs for the /api/metrics JSON snapshot endpoint. New metrics: scion.hub.auth.*, scion.hub.registration.count, scion.hub.join.*, scion.hub.rotation.count, scion.hub.dispatch.*, scion.hub.brokers.connected, scion.hub.gcp.token.*, scion.hub.gcp.iam.duration. Closes #240
M1: Add operation attribute to RecordDispatch OTel instruments so dispatch failures can be broken down by operation type. M2: Expand hub-auth metric group to cover all broker lifecycle instruments (registration, join, rotation, brokers, dispatch) — not just scion.hub.auth.*. M3: Gate SetMetrics(otelMetrics) on broker auth being enabled, preventing /api/metrics from showing an all-zeros broker block when auth is disabled.
… logging Phase 3 of metrics-delivery: add observability to the agent-side telemetry pipeline so we can confirm it's working end-to-end in production. - Add scion.telemetry.pipeline.status gauge (Int64, value=1) that self-reports via the cloud exporter on a 60s ticker, confirming the pipeline is alive - Add scion.telemetry.export.errors counter with signal and error_type attributes, incrementing on metric/span/log export failures - Add classifyError() to bucket errors into timeout/auth/quota/other - Upgrade credential logging at pipeline startup from log.Info format strings to structured slog.Info with credentials_file, source, project_id, provider, and cloud_configured fields - Add structured slog.Warn when cloud export is not configured, including the env var and well-known path that were checked - Fix slog handler in pkg/sciontool/log to render key=value attributes instead of silently dropping them - Add pipeline_health_test.go with tests for health gauge lifecycle, nil-safe export error recording, and classifyError bucketing
…trics initSelfMetrics() creates providers via NewProviders() but only needs the MeterProvider. The TracerProvider and LoggerProvider were never shut down, leaking goroutines. Shut them down immediately after creation.
- Replace os.Setenv+cleanup with t.Setenv in hubmetrics tests - Wrap standalone os.Unsetenv calls with error checks - Handle Shutdown errors on TracerProvider, LoggerProvider, MeterProvider - Check pipeline.Stop error in test cleanup - Convert switch to tagged form (staticcheck QF1002) - Fix MeterProvider leak on early return in startHealthGauge
…metrics) Clarify the two distinct metric families in Scion: - Infrastructure metrics (scion.hub.*, scion.db.*, scion.dispatch.*) for platform health, produced by the Hub process - Agent metrics (gen_ai.*, agent.*) for harness/model telemetry, produced inside agent containers via the telemetry pipeline Also defines the Telemetry pipeline term.
Summary
Phase 1 — Wire Hub OTel Recorders
hubmetrics.NewMeterProvider()with GCP Cloud Monitoring exporterdbmetricsanddispatchmetricsrecorders inserver_foreground.goPhase 2 — Bridge Hub In-Memory Counters to OTel
OTelMetricsRecorderimplementingMetricsRecorderwith OTel instruments + atomic snapshotsOTelGCPTokenMetricswrapping GCP token methods with OTel instrumentsserver_foreground.gowith graceful fallbackPhase 3 — Pipeline Health Observability
scion.telemetry.pipeline.statusgauge that self-reports via cloud exporter every 60sscion.telemetry.export.errorscounter with signal/error_type attributesCloses #241
Test plan
/api/metricsJSON endpoint backward compatiblescion.telemetry.pipeline.statusappears in Cloud Monitoring